AITopics | coreset construction algorithm

Country: North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.70)

Trevor Campbell, Boyan Beronov

Sparse Variational Inference: Bayesian Coresets from Scratch

Neural Information Processing SystemsOct-3-2025, 01:43:13 GMT

The proliferation of automated inference algorithms in Bayesian statistics has provided practitioners newfound access to fast, reproducible data analysis and powerful statistical models.

artificial intelligence, coreset construction, machine learning, (13 more...)

Country:

Europe > United Kingdom (0.14)
North America > Canada > British Columbia > Vancouver (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.94)

Neural Information Processing SystemsJan-25-2025, 03:52:26 GMT

Reviews: Coresets for Clustering with Fairness Constraints

This paper introduces a new coreset construction mechanism for fair clustering in which the points can be of multiple disjoint types. As in classic fair clustering, the goal of this work is to construct a clustering in which the types represented in each cluster are balanced. Unlike previous work, the focus here is on constructing the clustering efficiently via coresets. This work provides a coreset construction algorithm for fair k-median (previously unknown) and improves the previously known coreset construction algorithm for fair k-means. In addition to theoretical contributions with respect to coreset size and construction time, the authors also provide a small empirical study.

coreset, coreset construction algorithm, fairness constraint, (11 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.41)

Neural Information Processing SystemsMar-13-2024, 17:51:02 GMT

Distributed k-Means and k-Median Clustering on General Topologies

This paper provides new algorithms for distributed clustering for two popular center-based objectives, k-median and k-means. These algorithms have provable guarantees and improve communication complexity over existing approaches. Following a classic approach in clustering by [13], we reduce the problem of finding a clustering with low cost to the problem of finding a coreset of small size. We provide a distributed method for constructing a global coreset which improves over the previous methods by reducing the communication complexity, and which works over general communication topologies. Experimental results on large scale data sets show that this approach outperforms other coreset-based distributed clustering algorithms.

algorithm, communication cost, coreset, (14 more...)

Country: North America > United States > Georgia > Fulton County > Atlanta (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)

arXiv.org Machine LearningFeb-15-2020

On Coresets for Support Vector Machines

Tukan, Murad, Baykal, Cenk, Feldman, Dan, Rus, Daniela

We present an efficient coreset construction algorithm for large-scale Support Vector Machine (SVM) training in Big Data and streaming applications. A coreset is a small, representative subset of the original data points such that a models trained on the coreset are provably competitive with those trained on the original data set. Since the size of the coreset is generally much smaller than the original set, our preprocess-then-train scheme has potential to lead to significant speedups when training SVM models. We prove lower and upper bounds on the size of the coreset required to obtain small data summaries for the SVM problem. As a corollary, we show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings. We evaluate the performance of our algorithm on real-world and synthetic data sets. Our experimental results reaffirm the favorable theoretical properties of our algorithm and demonstrate its practical effectiveness in accelerating SVM training.

algorithm, coreset, sensitivity, (14 more...)

2002.06469

Country:

North America > United States > New York (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > Middle East > Israel > Haifa District > Haifa (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (1.00)

arXiv.org Machine LearningApr-11-2019

Robust Coreset Construction for Distributed Machine Learning

Lu, Hanlin, Li, Ming-Ju, He, Ting, Wang, Shiqiang, Narayanan, Vijay, Chan, Kevin S

Motivated by the need of solving machine learning problems over distributed datasets, we explore the use of coreset to reduce the communication overhead. Coreset is a summary of the original dataset in the form of a small weighted set in the same sample space. Compared to other data summaries, coreset has the advantage that it can be used as a proxy of the original dataset, potentially for different applications. However, existing coreset construction algorithms are each tailor-made for a specific machine learning problem. Thus, to solve different machine learning problems, one has to collect coresets of different types, defeating the purpose of saving communication overhead. We resolve this dilemma by developing coreset construction algorithms based on k-means/median clustering, that give a provably good approximation for a broad range of machine learning problems with sufficiently continuous cost functions. Through evaluations on diverse datasets and machine learning problems, we verify the robust performance of the proposed algorithms.

algorithm, artificial intelligence, machine learning, (17 more...)

1904.05961

Country:

North America > United States > Pennsylvania > Centre County > University Park (0.04)
North America > United States > Maryland > Prince George's County > Adelphi (0.04)

Genre:

Research Report (0.81)
Overview (0.67)

Industry:

Education > Focused Education > Special Education (1.00)
Government (0.68)
Information Technology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Campbell, Trevor, Broderick, Tamara

Automated Scalable Bayesian Inference via Hilbert Coresets

arXiv.org Machine LearningOct-13-2017

The automation of posterior inference in Bayesian data analysis has enabled experts and nonexperts alike to use more sophisticated models, engage in faster exploratory modeling and analysis, and ensure experimental reproducibility. However, standard automated posterior inference algorithms are not tractable at the scale of massive modern datasets, and modifications to make them so are typically model-specific, require expert tuning, and can break theoretical guarantees on inferential quality. Building on the Bayesian coresets framework, this work instead takes advantage of data redundancy to shrink the dataset itself as a preprocessing step, providing fully-automated, scalable Bayesian inference with theoretical guarantees. We begin with an intuitive reformulation of Bayesian coreset construction as sparse vector sum approximation, and demonstrate that its automation and performance-based shortcomings arise from the use of the supremum norm. To address these shortcomings we develop Hilbert coresets, i.e., Bayesian coresets constructed under a norm induced by an inner-product on the log-likelihood function space. We propose two Hilbert coreset construction algorithms---one based on importance sampling, and one based on the Frank-Wolfe algorithm---along with theoretical guarantees on approximation quality as a function of coreset size. Since the exact computation of the proposed inner-products is model-specific, we automate the construction with a random finite-dimensional projection of the log-likelihood functions. The resulting automated coreset construction algorithm is simple to implement, and experiments on a variety of models with real and synthetic datasets show that it provides high-quality posterior approximations and a significant reduction in the computational cost of inference.

artificial intelligence, bayesian inference, machine learning, (18 more...)

1710.05053

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > District of Columbia > Washington (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)

Balcan, Maria-Florina F., Ehrlich, Steven, Liang, Yingyu

Distributed $k$-means and $k$-median Clustering on General Topologies

Neural Information Processing SystemsDec-31-2013

algorithm, artificial intelligence, machine learning, (15 more...)

Country: North America > United States > Georgia > Fulton County > Atlanta (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)

Balcan, Maria Florina, Ehrlich, Steven, Liang, Yingyu

Distributed k-Means and k-Median Clustering on General Topologies

arXiv.org Machine LearningOct-30-2013

This paper provides new algorithms for distributed clustering for two popular center-based objectives, k-median and k-means. These algorithms have provable guarantees and improve communication complexity over existing approaches. Following a classic approach in clustering by \cite{har2004coresets}, we reduce the problem of finding a clustering with low cost to the problem of finding a coreset of small size. We provide a distributed method for constructing a global coreset which improves over the previous methods by reducing the communication complexity, and which works over general communication topologies. Experimental results on large scale data sets show that this approach outperforms other coreset-based distributed clustering algorithms.

algorithm, coreset, graph, (10 more...)

1306.0604

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)